Predicting ExoPlanet Discoverability based off Planetary Features

Phase 2: Statistical Modelling

Group Name: 40

Name(s) & ID(s) of Group Members:

Rafat Mahiuddin (s3897093)

Adrian Rebellato (s3889401)

Arthul George (S3918048)

Table of Contents

Introduction

Phase 1 Summary

Phase 1 required us to process our data and approach the standard required for multiple linear regression modeling. We removed unnecessary columns, calculated new columns using data inferred from previous features, found outliers and dropped rows with missing values. We are left with almost 3000 rows, with no missing or unusual data.

Our goal is to explore what factors influence planet discoverability, by creating a Multiple Linear Regression model to predict planet radius. To facilitate this, we made two assumptions at the beginning of the study: That the distribution of exoplanets is independent of their distance from Earth, and that the radius of a planet is correlated with ease of discoverability. Our exploration suggests that the latter is true. We can also assume the former is true (until proven otherwise) as it is the simplest, and most agreed upon description of our universe accepted by astrophysicists.

In phase 1, exploration into the relationships between features revealed a strong link between orbital distance and orbital period, distance from earth and parallax, and planet mass and radius. We also discovered how the number of exoplanets per star system dropped to 1 after 2500 parsecs. This further supports our assumptions. Finally, exploring the positional relationship between exoplanets revealed to us the Kepler mission, and how it dominated our dataset.

These relationships have helped inform the pre-processing we will conduct before creating the MLR model in Phase 2.

Report Overview

The goal of this report is to determine exoplanet radius based off certain astronomical features. We decide to use both MLR and DNN as our modelelling techniquEs to fulfil our goal. We will also be comparing both models to determine which can better, and more accurately, predict our target feature.

We conduct one-hot encoding for our categorical variables, and normalize our numeric features, then build a full MLR model. After performing our diagnostic checks on the MLR model, we conclude that our residual plot is bimodal, which contributes to a weaker MLR. We than use backwards feature selection to build a reduced MLR using 13 variables, which achieves a similar r-squared value to our full model: 0.43.

Concluding that our prediction accuracy may be improved upon, the report goes on to outline the creation of a deep neural network. This begins through dedicated data processing for our dnn followed by fine tuning of its hyperparameters via the use of various fine tuning plots. Resultingly we manage to achieve a significant improvement in accuracy for predicting our target feature. This is further discussed in the DNN Discussion (Literature) subchapter.

The Phase 2 report outlines the additional data processing required for multiple linear regression, full and reduced MLR fitting, diagnostics checks for MLR modelling. Additionally, this report also contains data processing for deep neural networks, neural network fitting, fine tuning plots, neural network discussion, critiques / limitations and a summary of our findings.

Overview of Methodology

We explore our data using both Multiple Linear Regression and Neural Networks to predict the value of planet radius, in order to explore exoplanet discoverability.

Multiple Linear Regression is a model that relies on multiple explanatory variables to predict a target feature. Below you can find the full model using all of our explanatory variables, as well as a reduced model which has utilized backwards feature selection to cull weaker explanatory variables.

The Multiple Linear Regression model predicts planet radius with an r-squared value of 0.4. This is not a strong prediction, so to develop a stronger model we used a deep neural network, explained further below.

Data Processing

Module Imports

One Hot Encoding

Data Normalisation

Statistical Modelling

Model Overview

Our first attempt at modelling our data will be a Multiple Linear Regression model using all of our explanatory variables.

Here are the variables we will be using for this particular model:

Model Fitting

Feature Selection

Formula String

OLS model to encoded data

Visualizing the accuracy of our model by plotting actual radius vs. predicted radius

Full Model Diagnostics Check

We would like to check whether there are indications of violations of the regression assumptions, which are:

  1. Linearity of the relationship between target variable and the independent variables.
  2. Constant variance of the errors.
  3. Normality of the residual distribution.
  4. Statistical independence of the residuals.

From this plot we see that the residuals exhibit a banding pattern, especially when Predicted Radius is below 10. The impact of the Keplar mission (as explored in Phase 1) can also be seen in this plot. The majority of datapoints which make up the left most hotspot seem to be over-estimated. Based on our previous exploration, we can assume that these data points are from the Keplar mission, which had higher success finding smaller, Earth-like exoplanets.

The shape of this plot shows that the model over-estimates the radius of small exoplanets.

Visible in this histogram is how our original dataset has two peaks, one from Keplar, and the right most peak from all other missions. Our model seems to pick the middle ground, which causes significant inaccuracy.

This violates the normality assumption for our residual distribution, which may cause our MLR model to be significantly weaker than expected.

Performing backwards feature selection (credit).

Calculating Residuals for model visualization


Deep Neural Networks (DNN)

Import Statements

Data Reprocessing for DNN

To make our DNN more prcise, we will have to further control the amount of data that is passed through our neural network. This involves dropping features that do not contribute significantly to our target. Please note that both the results for the MLR and DNN has been scaled equally to allow for a fair comparision. This would become obvious in the Fine Tuning Plots subchapter.

Outlier Processing for DNN

Conversely to our intial beliefs, the DNN failed to perform approapiately when we filtered out outliers. At some cases, our NN performed so terribly it would be almost impossible to graph it. This is predominantly due to the oulier check getting rid of two thirds of our dataset, and only leaving highly biased data from the Keplar Mission. The lack of variety, far too many columns, and a much smaller dataset led to our NN making constant, obvious (and boring) predictions. To tackle our outliers, we chose to use an outlier friendly optimizer, which will be further expanded in the literature.

Apply Feature Selection and Column Processing for DNN

The following code reduces the number of unnecessary columns both by feature selection (of p values) and manual selection, to make our NN far more accurate.

Data Partition

Partition the dataset into both training and testing datasets.

Target Feature Partition

Data Normalisation

Plotting Functions

Functions for plotting graphs later on.

DNN Model Build and Compilation

Builds and compiles the model structure. Further discussed in literature.

DNN Model Training

Train the model based off the training data. Further discussed in literature.

Fine Tuning Plots

Key plots that denotes model performance, which is than used to fine tune our network. Even though this process will be discussed upon in our literature, small summaries underneath each graph will highlight the intentions behind every plot.

Loss Analysis

Figure 14 demonstrates the loss anlaysis graph for our current dnn. The graph shows error prediction of our target feature, planet radius, as the network gradually learns (over epoch). As a result, loss analysis is key to determining the overall learning performance of a neural network. To elaborate further, loss is the value the neural network is trying to minimize over time, a lower loss value is indicative of more accurate predictions. Consequntly, the dnn learns by readjusting tis node's weights and biases in a manner which reduces the loss. Val Loss is applied to the test set, and is a good indication of how the network will perform with data it has never seen before. (1)

Our loss analysis graph indicates a good learning progress for our dnn. Firstly, both loss values takes the shape of an inverse logarthmic, which demonstartes a great learning performance as our epoch progresses. Secondly, the error values for both curves can be considered to be quite low (in terms of astronomical predictions), highlighting high accuracy when predicting our target feature. Finally, our loss and val_loss curves are very tightly packed together. This illustrates our dnn learning the data approapiately, rather than just memorising the training data, or failing to see any connections within the dataset. Despite such positives, there does seem to be a minor issue of the curves becoming quite jaggered, indicating a lack of confidence in our dnn prdeictions. This, however, may be a limitation of the dataset we are working on (see the discussion subchapter to understand how this was reduced).

Overall, fig 14. displays the great learning progress of our dnn as epoch progresses. However, the curves do come out as slightly jaggered, which may show our dnn strugling to learn at certain stages.

MSE per Epoch

Figure 15 denotes the mean squared error (mse) as epoch increases for our dnn. "MSE is an absolute measure of how well the model fits" (https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b), and is denoted by the following formula:

image.png

Given so, fig. 15 heavily takes on the shape of fig. 14, confirming the great dnn performance over both train and test data. In analysis, our mse graph signifies a very high accuracy rate for our current dnn model. The low mse values is indicative of how little the predicted results (in both training and test) deviate form the actual number. However, the slightly jaggered curves do highlight our model struggling to learn at certain cases (as mentioned previously). Again, this is likely to be a limitation of the dataset that was used.

Data Output

Output MSE data.

DNN Model Prediction

Create a prediction based off our test data.

Same Scale Normalisation for MLR Results

Normalise MLR data to the same scale so accurate comaprisions may be made.

Key Statistics

Figure 16 displays key statistics for both our MLR and DNN models. We will go through them one by one.

  1. The first highlights the standard deviation (std) for both models. This is predominantly used as a safety check during fine tuning to see whether our dnn code spat out nonsense. However, from the above graph, we can see that our dnn has a similar std to that of our mlr.
  2. Our second statictic demonstrates the mean. As mentioned in the previous dot point, this is primarily used as a check in fine tuning. Fig 16 denoted a notable increase in mean for our mlr whn compared to dnn. This, however, is due to our mlr over predicting a good portion of its results (as shown in the next graph), skewing its mean to the right. At the samee time, this highlights our dnn to be far more accurate than our mlr.
  3. Our final statistic is a measure of the area beneath the bell curve (between x=0.5 and x=-0.5) for both models. In other words, it illustrates the tightness of the models' normalised bell curve (and how accurate their predictions are). From the above, we can extrapolate our mlr to have a far greater area (is more loose from the mean) when compared to our dnn model (more tightly packed towards the mean). Indicating our dnn to perform significantly better than our mlr model (as shown in the next few plots). This was especially useful in fine tuning our dnn to yield better results than our mlr.

Prediction Performance Analysis

Figure 17 demonstrates the prediction performance for both of our models. Over here, we will call the black line the referance line, the green is MLR while red is our DNN. The y axis represents predictions made by the model, while the x indicates the true value (essentially, the closer it is to referance, the greater prediction performance the model has). Notice how the scale is between 0 and 1, resulting in 0.03 - 0.10 most likely representing planets around 3/4 to 4 times the size of planet earth. Each dot represents the model predicting an exoplanet exists under its corresponding, estimated radius.

Form both plots, it is evident that our dnn outperforms our mlr by a large margin. Our mlr model is seen to overpredict in its predictions, while making predictions over a larger range when compared to our dnn. This is further supported by our kde plots. Irregular rings on both models depicts the models primary prediction fields. In fig. 17b, our dnn seem to make predictions centered around the referance while our mlr places its fields over the referance. Interestingly, our dnn manages to understand patterns while predicting our outliers (in astronomy and given our dataset, the ability to predict larger, harder to detect planets in our universe makes our model more reliable and accurate).

Fig. 17a better demonstrates larger planet prediction for both models. Compared to our dnn, the mlr makes frequent large planet predictions despite them being quite rare in our dataset. In stark contrast, our dnn focuses its predictions towards the keplar cluster, while still maing rare predictions for larger planets. However, while still predicting larger planets, both models clearly struggle in such prediction. 17a, clearly highlights an issue in which the models will try to create radical predictions for larger planets. While most predictions of large planets are fairly accurate, this is a limitation of the dataset that was used, which rarely captured extremelly large exoplanets.

Overall, the above plots plays as great evidence of our dnn outperforming our mlr. To our surprise, the dnn seem to make fairly accurate predictions for most of the outlier exoplanets in our dataset, however it does tend to underextimate as the target feature increases.

Residual Performance Analysis

Figure 18 displays residual performance for both prediction models. The blue graph indicates our dnn while the orange is our mlr.

The above clearly depicts our dnn containing a high frequency of near 0 residual, demostrating excellent residual performance. This is in direct comparision to the mlr, which demonstrates a wider residual curve, which is often indicative of lower accuracy. It is also interesting to note the tails of both distributions. Our dnn curve demonstrates an inflated tail on the left, which may indicate underestimation for larger planets. Meanwhile, the mlr demonstrates a noticeable bump on its right tail, which explains the extreme overestimation fhow in fig. 17b.

Overall, the residual performance graph signifies our fined tuned dnn having a much accurate performance when compared to our mlr.

DNN Discussion (Literature)

Overview

This discussion will aim to inform the reader about the neural network that was used, its structure, the fine tuning process and other relevant decisions. Performance and limitations of the network will also be discussed.

Hyperparameters

Our deep neural network (dnn) made use of the following parameters.

  1. Number of input nodes: 6

  2. Hidden layer:

    • Number of hidden layers: 2
    • Number of nodes (neurons): 24
    • Activation type (both layers): Rectified Linear Activation Unit
    • Hidden layer 1 dropout: 0.015
    • Hidden layer 2 dropout: 0.000
  3. Output layer: Number of nodes: 1

    • Activation: Linear
  4. Compilation process:

    • Optimizer: Adaptive moment estimation (ADAM)
  5. Model training:

    • Validation Split: 80 / 20 (%) → Training / Validation
    • Epochs: 150

The decision using such parameters will be discussed in the process section.

DNN Structure

Our network contains a total of 55 nodes and 744 edges. Here is the structure:

image.png

The Process Behind our DNN structure

As seen above, our dnn contains 6 input nodes followed by two hidden layers. Initially, we passed in 13 features making our network quite complex. This took a toll on accuracy and the network began to make quite radical predictions, which resulted in harsher, jagged lines for our loss graph, as shown below:

image.png

After this, we realised we were passing in columns that had little meaning towards our target feature, so we decided to prune the amount of features we passed in to about 6. This drastically stabilized our current loss graph:

image.png

Adding dropouts in between hidden layers also helped in creating data that is more reliable. However, having a high dropout number resulted in the two lines merging together and warping in unsuspecting ways (they would become more jaggered for ex.). This led us to a number of 0.015 which gave us better prediction performance.

In terms of the amount of hidden layers, we initially experimented with 3 layers:

image.png

And even 5:

image.png

Even though we achieved a very good learning rate using 3 (or even 5 layers), it overcomplicated the network by a large factor. First off all, the network was failing to learn the data but just became really good at just memorising the training data. This led to very distant endpoints for both of our curves which we deemed undesirable. We decided on 2 hidden layers as it simplified the network (reduced jagged curves) and it allowed our val_loss and loss to more tightly stick together. This resulted in our dnn performing better on unseen data when compared to having a more complex model.

For the number of nodes, we initially tested out 64 nodes. After further experimentation, we doubled that value to 128. When simplifying our network however, we decided that an increase in nodes negatively affected the stability of our entire network, and resulted in a greater variance in predictions. After adjustments, we found 24 nodes gave us a nice middle ground between stability and prediction performance.

As our problem consisted of a linear regression, we utilised a single noded output with a linear activation function. Hidden layers also used relu as their activation function given our linear regression problem. Such a function allowed for the network to learn the complex relationships within our data better given our problem type. (3)

During the compilation process, an Adam optimizer was chosen over SGD. While SGD worked desirably with our dataset, we found Adam to optimize the data in a more predictable fashion. As "[Adam] combines the advantages of [...] AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives" (4), we found Adam to better predict larger planets (outliers) more precisely when compared to SGD. It also improved the overall prediction performance of our NN by predicting exoplanets closer to the reference line in fig. 17.

Epochs played a great role in the way in which our dnn learned. A high number of 500 epochs was initially tested and yielded the following results:

image.png

Such a high epoch number encouraged our network to memorise the training data rather than learn from it. This explains why the network progressively performs worse on data it has not seen before (error rate increases) when compared to the error rate from data it has learned from (error rate decreases). After experimenting with an epoch of 100, we realised the learning was cut off too shortly and therefore decided with an epoch of 150. This number consistently produced a reliable loss graph over multiple trials.

This concludes the discussion portion of our deep neural networks. Overall, we discussed the hyperparameters used, dnn topology, its structure, and the process behind tuning our dnn. Additional tuning graphs were provided to complement the discussion.


Critique & Limitations

The primary underlying flaw of this two-phased report lies within our dataset. Historically, large clusters of planets are discovered based on missions that span over a specific period of time over a specific instrument. One of the instruments used to create our dataset was the Kepler Space Telescope. Kepler’s primary mission was to discover earth-like planets that may harbour extraterrestrial life. As a result, its discoveries were skewed towards smaller earth-like planets rather than larger planets (which are often uninhabitable). When graphing our target feature, this generated two dissimilar peaks that both of our models experienced trouble learning across. Our DNN performed noticeably better than our MLR in such cases.

On another worthy note, it is worth mentioning how our dataset may not be truly representative of all exoplanets in the universe. Given Kepler’s mission and how exoplanets are truly dependent on certain features, our dataset may mostly have exoplanets that were more easily discoverable. A rogue planet with no star, planets orbiting a very dim or a smaller star may actually be far more common in the universe rather than the regular planets within our dataset, but are just harder to record their existence. These planets may therefore seem much rarer to us even if they are more common in the universe.

Unfortunately, our mlr model possesses some notable weakenesses in terms of prediction accuracy. As metioned before, our dataset contains two irregular peaks indicative of smaller sized planet and larger exoplants. Our mlr model tries to merge these two peaks together to create a single peak, and in the process of doing so, creates a large amount of predictions that overfit the data. This may be more clearly illustrated by the plot: Figure 11. The residuals for our mlr data also fail to obey the 4 main rules for residuals. Firstly, as our MLR faces trouble by our two peaked dataset the residuals do not produce a perfectly normalised graph. Secondly the residuals are also inconsistent (in terms of their distribution) as planet radius grows. However, where the mlr has its weaknesses, our dnn proves its strengths.

In terms of strengths for our dnn model, it can create predictions based on our target with high accuracy. Unlike our MLR model, the dnn is not disturbed by the two different peaks in our datasets and accurately realises trends, especially for smaller exoplanets (fig. 17). While our model sometimes under-estimates exoplanets' radius for larger planets, it still manages to mostly make fairly accurate predictions. Another notable strength is clustering in its prediction. Our dnn model prefers to cluster its predictions over the reference line, making predictions far more accurate in comparision to the massive spread of predictions given out by MLR.

Another untold advantage of our mlr model is that it tends to produce a more linear prediction field for larger values (fig. 17a) when compared to our dnn. While our dnn is quite competitive (even for larger exoplanets), our MLR may actually perform better when faced against larger exoplanet predictions.

Summary & Conclusions

Project Summary

Phase 1

The summary for Phase 1 will be described in the following steps:

  1. We first selected the type of dataset we would want to work with and then decided to build a custom dataset on the NASA's exoplanet archive database.
  2. Afterwards, we decided on our target feature and goals and objectives. We initially made our target feature the number of planets in a single system but we soon relised that the data was in fact nominal categorical. Planet radius was then selected to be our target feature as other crucial properties (such as mass and orbital distance) may be denoted from a planets radius. An objective of predicting the size of discovered exoplanets based on features of the solar system. From this we can extarpolate linearly correlated features to determine our goal finding features that influence the xlixkelihood of planet discovery.
  3. We than cleaned and processed the data for phase 2.
    • Unecessary columns were removed to make our dataset more focused and suitable for dnn and mlr models.
    • Columns were renamed to easy to read names.
    • Additional columns mass_ratio_sys and radius_ratio_sys were calculated
    • All NaN values were dropped.
  4. Our data was then explored through a number of diffrent graphs. Some notable ones included:
    • Planet discoveries based upon their orbital period and orbital radius
    • Mass to radius relation for discovered planets
    • Exo-planet projected location in relation to the number of planets discovered in its system
  5. Processed, phase 1 data was then processed to be used in phase 2.

Phase 2

The summary for Phase 2 will be described in the following steps:

  1. Cateegorical values was one hot encoded into seperate columns.
  2. We than used MinMaxScaler to normailse all numeric types in our dataset.
  3. Feature selection was performed
    • A formula strin was created
    • Our model was then fitted to create a multi linear regression model (MLR)
    • Plots were created to measure the prediction performance of MLR
    • The accuracy of our model was then determined by calculating and plotting residuals
    • A further full model diagnostic check was performed to determine whether the model is healthy
  4. Backwards feature selection was performed to remove features with a p value greater than 0.05.
    • Prediction performance after backwards feature selection was plotted to analyse any improvements / lack of improvements.
  5. A Deep Neural Network was generated over the same dataset and target feature.
    • Specific data processing for NN was conducted (for ex. dropping of addtional columns tox increase the accuracy).
    • Partition of the dataset into training and testign datasets
    • NN model structure was built and compiled with approapiate parameters
    • DNN was then called to train on the training dataset
    • A combination of 5 graphs were used to fine tune the neural network and predict its performance realtive to our mlr.
    • A A discussion followed after, informing the reader the dnn used, its structure, the process in which the model was tuned and the hyperparameters used.
  6. Limitations of our datset was explored followed by writing upon the strengths and weaknesses of our currrent models.

Summary of Findings

Our R-squared value for our reduced model with 13 variables is 0.427. In this moderate strength model, semi_major_axis, orbital_period, star_mass, star_radius and distance play the largest role in predicting planet_radius. We predicted that as distance increases, planet radius would increase, but our findings contradict this. We do not know whether this would be the case if our dataset was unimodal. Semi_major_axis (or the average distance an exoplanet is from its star) was the strongest indicator of star size. Star radius and star mass were also highly weighted, which matches our predictions that larger stars would provide more detectable exo-planets.

Diagnostic Checks indicated that there was slight banding for planets below the radius of 10, and a bimodal residual distribution. This may be the cause of the lack of MLR strength The significantly higher accuracy of our neural network suggests that there are much better ways of predicting planet_radius than MLR with our chosen dataset features. This model accurately estimates the radius of exoplanets discovered by Kepler, whereas the MLR model overestimates these planets.

The residuals of the neural network exhibit a much more normal, and tighter distribution around 0, indicating that the neural network much more accurately predicts planet_radius. The connections within the neural network are not easily parsable, so while we cannot draw direct conclusions from these relationships like with MLR, we still have a highly accurate model.

Conclusions

Our objective was to predict the radius of a discovered exoplanet, and explore the features which most strongly affect this prediction. Due to the nature of our dataset, our MLR model was not accurate enough to fully explore this relationship, however some features were highly significant. We discovered significant relationships between planet radius, orbital distance and star size, despite the difficulties of the bimodal model.

We can accurately predict planet radius using the Deep Neural Network to a much higher degree of accuracy compared to our reduced MLR model. This model has also been confirmed to not be overtrained to our dataset, and as such it may be suitable for new exoplanet data, and for predicting the presence of exoplanets in unsurveyed systems.

Referances

  1. Nicolas Gervais, N. G. (2017, November 15). How to understand loss acc val_loss val_acc in Keras model fitting. Stack Overflow. Retrieved October 20, 2021, from https://stackoverflow.com/questions/47299624/how-to-understand-loss-acc-val-loss-val-acc-in-keras-model-fitting

  2. Wu, S. (2021, June 5). 3 Best metrics to evaluate Regression Model? - Towards Data Science. Medium. Retrieved October 21, 2021, from https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b

  3. Brownlee, J. (2020, August 20). A Gentle Introduction to the Rectified Linear Unit (ReLU). Machine Learning Mastery. Retrieved October 21, 2021, from https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

  4. Kingma, D. P. (2014, December 22). Adam: A Method for Stochastic Optimization. ArXiv.Org. Retrieved October 23, 2021, from https://arxiv.org/abs/1412.6980